I use functions provided by the creator of this Kaggle dataset:
COVID-19 dataset.
It uses the data provided by https://www.worldometers.info/ and https://github.com/CSSEGISandData/COVID-19.
There are some countries without assigned WHO region. I complete them by other dataframes and remove the ones that remained without category.
We can see that for some countries we have no data about the tests, resulting in TotalTests=0.
There are 183 countries taken into account.
I used this part in the sankey plot visualization. After normalising for each region (America would stifle others), I realised that the sum of Active, Deaths and Recoveries is not equal to Total for Europe. Therefore, I checked which countries have people missing.
This is really concerning. Over 1M cases has gone missing somewhere along the way.
I checked the min and max values in the dataframe. As it occured, some values were given negative. As it does not make sense, I decided to treat it as a typo and preserve their absolute values.
I make a merged dataframe with consecutive WHO Regions - this way I fill wordometer data.
I want to have uniform color coding throughout this file. Therefore, I sort this dataframe by wordometer region order.
I check where total number of cases is not as it is supposed to be.
I use divik package to have consistent color theme for all palettes, no matter their length.
This palette provides colors for:
I make palette for one calendar year. This will have to change in "Month 1", "Month 2" and so on form in the future, as the pandemic continues, also in the dataframes.
I chose some more observed countries and Poland additionally for comparison. It can be seen later on.
I chose some for comparison the group that was a result of hierarchical clustering in Happiness Report notebook. The group can be seen later on.
This palette will be for observing the structure of cases so far for specified region: "Active", "Recovered" or "Death". It does not consist of 3 elements as the colors for 3 are not as diverging as I wanted.
This palette is for cases structure including the state of the "Active" - "Mild" and "Critical". I do not use it instead of previous as the revelant df does not have the date specified.
Palette for discretized tests/1M people.
Palettes for comparing epidemics part.
I use functions developed during previous visualizations.
I want to check if the mortality ratio structures are significantly different for specified testing groups.
First of all, we can notice that at this point some countries tested more people than their population counts. They obviously have the best ratio.
Unsuprisingly as well, the countries with no recorded tests have higher mortality.
There is only one country that has less than 1 k reported tests if reported at all, therefore it wouldn't appear on the plots. I add it to 10ks.
However, the tests to 1 million are moved to the penultimate place.
Here I gather the sums of Tests and Cases for each region and normalize it accordingly. Because of the previous observations, I will treat Inactive cases as the difference between Total and Active. This way the sum will be equal. However, the Recoveries and Deaths will not sum up to the Inactive case number.
I change the form of the dataframes to cover only the days since the first case in specified group.
I start with the regions - this way it will be possible to investigate the course of the pandemic in each WHO Region specified.
I normalize the data by the populations.
I also want to compare some of the countries. To do so, I gather their info and follow the previous procedure.
Here I use the following countries:
When I was first using the spotlight group, I had the data until the end of July. Poland seemed to be unfitting in the group, so to speak, so for the reasons such as population and so on I decided to choose another countries for comparison. As I did some visualisations on Happiness Report before and had some clustering on the last year data, I decided to reuse it here.
I want to compare how much of each case outcome category - fatal, recovered, mild and serious - is built of cases in specific regions.
Of course, each group is heavily dominated by patients in Americas, making up more than 50%. We can see that there are more deaths and serious cases in Europe than in other parts of the world. Sout-East Asia, in turn, has many mild cases and recovered patients.
I use normalized data so that Americas won't overpower rest of the regions again. This time I check regions' structure, the other way around.
We can see that in comparison to other countries America has more active cases to inactive ratio. We can also notice that the death percentage is biggest in Europe.
It appears as if some cases are lost somewhere along the way in Europe.
Once again, we see the 3 countries that crosses the 1M/1M testing threshold. What is more, most of these countries are from Europe and there are none from Africa or South-East Asia.
None from Africa or Western Pacific. Lot from Americas.
Once again, no Western Pacific. We see the significant difference between first 3 countries and the rest.
South-East Asia seems coping pretty well. Yemen appear to be outlier not only for its group as Sudan is the next one from it and it's pretty much lower. There are many countries from Western Europe specifically.
Most of these are small countries/regions.
None from Africa. As expected, USA, India and Brazil are leading, with USA having significantly more active cases.
I am using day_wise data for this plot.
We can see that the percentage of active cases goes up and down for the second time now. The second rise started at the beginning of March, with its peak in early April.
The pandemic came into the Africa last. In most regions active cases number keeps growing. However, in Western Pacific this growth in dependence on population is the smallest.
Just like expected based on the previous plot, the active cases number keeps increasing.
Not normalized
Number of confirmed cases so far is changing the most drastically in Americas and South-East Asia.
Normalized:
Situation in Americas still looks the worst. However, the cases in Europe based on the region's population, are the second worst, unlike orevious plot.
Not normalized
We can see the spurt in the new cases for Western Pacific on day 22 for this region. Another great growth is noted on day 237 for South-East Asia. The jumps for last few days are high.
Normalized
Peaks in Europe keep on reaching valeys in Americas. Western Pacific copes the best.
Not normalized
Normalized
Not normalized
We see the reason of the sudden jump in the South-East Asia for new cases.
Normalized
Not normalized
Normalized
Not normalized
Normalized
I use the iso codes to map the data onto globe map.
For a long time (second half of March) there is significantly higher number of active cases in China than any other place. When the outbreak source starts to fade, we can see the spurt in Europe (several points) and beginning in US. The increases are not smooth. After that the rest of the world starts to be influenced as well. Cases in Africa are the last to be noticed.
The countries seem to be grouping.
I decided to use the Happiness Score data once again - the information about GDP, trust to the government, family relations, solidarity of the society and health might give us more insight into the pandemic situation.
We can see that countries with higher GDP can afford more tests for their populations.
I will be comparing only the variables pairs I have not considered before.
Considering that I took the data from the mid of September, it seems the raise in the new cases at this pace may not be this big as in the first months of pandemic. It looks as in May the growth was the least concerning.
I decided to compare the current pandemic with some of the prior ones. I chose ebola outbreak from 2014, swine flu from 2009 and sars from 2003.
day_wise is used.
As I want to observe differences between progress of each disease, I convert consecutive dates to days from the first record. I know that the outbreak started sometime in December or even faster, therefore the record started about 2 months after the beginning of the disease.
Some say the first cases are dated even to December 2013. This means that the record started almost 9 months after the outbreak. Therefore, I will be able to only compare the reactions and results after people decided that it is a real threat.
The swine flu was noticed pretty fast in comparison with ebola - about 3 months after first cases. However, as the WHO did not further request individual cases report, the data is to around sixth month of the disease - it stops before ebola record starts. Therefore, comparison of the consecutive days after the outbreaks is impossible.
The epidemic is said to have started in November of the year before. This once again means quite a huge difference between actual beginning and beginning of the record.
Here I collect the data from the first days of the record for each epidemic.
Here I collect the data from the first 70 days of the record for each epidemic - the swine flu dataset records only 73 days, so comparison of the last records in most cases would be wrong.
Here I collect the data for mortality ratio on the 70th days of the record for each epidemic. This will allow me to compare the world medical reactions to the diseases.
Alongside the previous mortality ratio, I will check if the world is learning how to deal with the serious cases.
We can see that COVID was more spreaded across the world at the start of the record than ebola and swine flu. Moreover, h1n1 was noticed at only 2 countries - it was USA and Mexico.
It seems like ebola was much more disregarded. However, it seems like at this point - at the end of January - there should have been more cases of covid noted in China alone.
COVID 2019 killed significantly more people than the rest of the ailments in the first 70 days.
It is also much more contagious. However, we can already notice that ebola cases number is much smaller than the death number. This suggests high mortality ratio.
I check the mortality ratio after 70 days - assuming that the world in each case had the same time to assess the situation after realising the threat.
We see that COVID-19 mortality ratio is much lower that ebola's or sars'. However, given the contagion pace, the new pandemic should not be disregarded - the number of deaths is still significantly higher.
We can see that the ratio for COVID-19 drops significantly - this may suggest progress in the situation, unlike for ebola and sars.